Cyclistic is a hypothetical bike-sharing company based in Chicago. Cyclistic has two membership types, casual and annual. In the past, Cyclistic’s marketing team has been targeting new members in order to increase the awareness of the company within communities. Now, the marketing team believes that the next step to take in order to grow the company to the next level is to convert the casual members into annual members.
In order to convert the casual members into annual members, I decided to analyze the difference in bike usage between the two member types and use the discovered insights to build effective strategies for the conversion. To do so, I asked and answered several questions.
In order to answer the questions, I downloaded one year worth of the Cyclistic members’ trip data (July 2021 - June 2022) from the Cyclistic database. The dataset can be downloaded from here.
Since the dataset has a total of more than a million rows, I decided to use SQL as my main tool for the analysis. However, before using SQL, I quickly inspected the data with Excel, checking things such as the attributes, the data layout, the data size and potential room for error.
Then, I created a table in PostgreSQL and imported the 12 csv files into the table.
-- Creating the table for the csv files.
CREATE TABLE cyclistic.trip_data (
ride_id text NULL,
rideable_type VARCHAR(15) NULL,
started_at timestamp NULL,
ended_at timestamp NULL,
start_station_name varchar(100) NULL,
start_station_id varchar(100) NULL,
end_station_name varchar(100) NULL,
end_station_id varchar(100) NULL,
start_lat float NULL,
start_lng float NULL,
end_lat float NULL,
end_lng float NULL,
member_casual char(6) NULL
);
-- Importing csv files into the 'trip_data' table. Do this for all 12 csv files.
COPY trip_data(
ride_id,
rideable_type,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
start_lat,
start_lng,
end_lat,
end_lng,
member_casual
)
FROM '/Users/bruce/Desktop/case_studies/cyclistic/cyclistic_data_original/cyclistic_data_202107.csv'
DELIMITER ','
CSV HEADER;
With this initial setup, I dived into answering the questions.
In order to find the average ride duration for each member type, I first had to calculate the ride duration of each trip. I did this by subtracting the ‘started_at’ column from the ‘ended_at’ column.
CREATE TEMP TABLE trip_data_v01 AS(
SELECT *, (ended_at - started_at) AS ride_duration
FROM trip_data
);
Upon inspection, I found some negative values in the ‘ride_duration’ column, which were errors, since there cannot be negative time value in ride duration. So I deleted the rows with negative ride duration values from the table.
DELETE FROM trip_data_v01
WHERE CAST(ride_duration AS text) LIKE '-%';
Then I queried to get the average ride duration of each member type.
-- Finding the average ride duration for each member type.
SELECT AVG(ride_duration) AS avg_ride_duration, member_casual AS member_type
FROM cyclistic.trip_data_v01
GROUP BY member_casual;
| avg_ride_duration | member_type |
|---|---|
| 00:29:49.144628 | casual |
| 00:12:58.955735 | member |
By answering Question 1, I learned that, on average, the causal members rode the bikes more than twice as longer as the annual members.
Inspecting the dataset, I realized that I did not have the data for the actual distance traveled on the bike. However, I did have the latitude and the longitude values of the start and the end of points of the ride. So, I decided to use the distance between the start and the end points as the metrics for the distance traveled.
In order to calculate the distance between the two points using the
latitude and the longitude values, I used the extension
cube and earthdistance.
CREATE EXTENSION cube;
CREATE EXTENSION earthdistance;
-- Finding the average distance between start and end points for each member type.
SELECT
AVG((point(start_lng, start_lat) <@> point(end_lng, end_lat))) FILTER(WHERE member_casual = 'casual') AS casual_start_end_distance_in_mi,
AVG((point(start_lng, start_lat) <@> point(end_lng, end_lat))) FILTER(WHERE member_casual = 'member') AS annual_start_end_distance_in_mi
FROM cyclistic.trip_data_v01;
| casual_start_end_distance_in_mi | annual_start_end_distance_in_mi |
|---|---|
| 1.411926 | 1.297408 |
By answering Question 2, I learned that there wasn’t much
difference in the average distances between the start and end points of
the ride of the casual members and the annual members. On average, the
casual members traveled just 8 percent further than the annual
members.
In order to figure out on which day the members preferred to ride the bikes, I needed to assign the appropriate day to every observation in the dataset. So I created a new column named “day.” Then, I queried the new table and converted the output values to percentages for fair comparison between the two member types.
-- Creating a temp table with 'day' columnn.
CREATE TEMP TABLE trip_data_v02 AS -- Creating a temp table with the 'day' columnn.
SELECT *, to_char(started_at, 'Day') AS day
FROM trip_data_v01;
-- Finding each member type's day preference for using the bike.
SELECT member_casual AS member_type, day, COUNT(day)
FROM cyclistic.trip_data_v02
GROUP BY day, member_casual
ORDER BY member_casual, day;
| member_type | day | count |
|---|---|---|
| casual | Friday | 363100 |
| casual | Monday | 303997 |
| casual | Saturday | 535565 |
| casual | Sunday | 467081 |
| casual | Thursday | 325171 |
| casual | Tuesday | 277765 |
| casual | Wednesday | 285483 |
| member | Friday | 469253 |
| member | Monday | 469039 |
| member | Saturday | 444150 |
| member | Sunday | 398919 |
| member | Thursday | 525932 |
| member | Tuesday | 518239 |
| member | Wednesday | 516545 |
By answering Question 3, I learned that the casual members used
the bikes more on the weekends, whereas the annual members used the
bikes more during the weekdays, but pretty steadily throughout the whole
week.
For this question, I used hours as the metrics and defined the hours by counting every ride 30 minutes before and 29 minutes after the actual hours. For example, “12AM” was defined by counting every ride between 11:30 PM and 12:29 AM. Then, again, the output values were converted to percentages for fair comparison.
-- Finding the usage of bike per time of the day for each member type.
WITH temp_table AS(
SELECT member_casual, CAST(started_at AS time)
FROM cyclistic.trip_data_v02
)
SELECT member_casual AS member_type,
COUNT(started_at) FILTER (WHERE started_at >= ('23:30:00'::time) OR started_at <('00:30:00'::time)) AS "12AM",
COUNT(started_at) FILTER (WHERE started_at >= ('00:30:00'::time) AND started_at <('01:30:00'::time)) AS "1AM",
COUNT(started_at) FILTER (WHERE started_at >= ('01:30:00'::time) AND started_at <('02:30:00'::time)) AS "2AM",
COUNT(started_at) FILTER (WHERE started_at >= ('02:30:00'::time) AND started_at <('03:30:00'::time)) AS "3AM",
COUNT(started_at) FILTER (WHERE started_at >= ('03:30:00'::time) AND started_at <('04:30:00'::time)) AS "4AM",
COUNT(started_at) FILTER (WHERE started_at >= ('04:30:00'::time) AND started_at <('05:30:00'::time)) AS "5AM",
COUNT(started_at) FILTER (WHERE started_at >= ('05:30:00'::time) AND started_at <('06:30:00'::time)) AS "6AM",
COUNT(started_at) FILTER (WHERE started_at >= ('06:30:00'::time) AND started_at <('07:30:00'::time)) AS "7AM",
COUNT(started_at) FILTER (WHERE started_at >= ('07:30:00'::time) AND started_at <('08:30:00'::time)) AS "8AM",
COUNT(started_at) FILTER (WHERE started_at >= ('08:30:00'::time) AND started_at <('09:30:00'::time)) AS "9AM",
COUNT(started_at) FILTER (WHERE started_at >= ('09:30:00'::time) AND started_at <('10:30:00'::time)) AS "10AM",
COUNT(started_at) FILTER (WHERE started_at >= ('10:30:00'::time) AND started_at <('11:30:00'::time)) AS "11AM",
COUNT(started_at) FILTER (WHERE started_at >= ('11:30:00'::time) AND started_at <('12:30:00'::time)) AS "12PM",
COUNT(started_at) FILTER (WHERE started_at >= ('12:30:00'::time) AND started_at <('13:30:00'::time)) AS "1PM",
COUNT(started_at) FILTER (WHERE started_at >= ('13:30:00'::time) AND started_at <('14:30:00'::time)) AS "2PM",
COUNT(started_at) FILTER (WHERE started_at >= ('14:30:00'::time) AND started_at <('15:30:00'::time)) AS "3PM",
COUNT(started_at) FILTER (WHERE started_at >= ('15:30:00'::time) AND started_at <('16:30:00'::time)) AS "4PM",
COUNT(started_at) FILTER (WHERE started_at >= ('16:30:00'::time) AND started_at <('17:30:00'::time)) AS "5PM",
COUNT(started_at) FILTER (WHERE started_at >= ('17:30:00'::time) AND started_at <('18:30:00'::time)) AS "6PM",
COUNT(started_at) FILTER (WHERE started_at >= ('18:30:00'::time) AND started_at <('19:30:00'::time)) AS "7PM",
COUNT(started_at) FILTER (WHERE started_at >= ('19:30:00'::time) AND started_at <('20:30:00'::time)) AS "8PM",
COUNT(started_at) FILTER (WHERE started_at >= ('20:30:00'::time) AND started_at <('21:30:00'::time)) AS "9PM",
COUNT(started_at) FILTER (WHERE started_at >= ('21:30:00'::time) AND started_at <('22:30:00'::time)) AS "10PM",
COUNT(started_at) FILTER (WHERE started_at >= ('22:30:00'::time) AND started_at <('23:30:00'::time)) AS "11PM"
FROM temp_table
GROUP BY member_casual;
| member_type | 12AM | 1AM | 2AM | 3AM | 4AM | 5AM | 6AM | 7AM | 8AM | 9AM | 10AM | 11AM | 12PM | 1PM | 2PM | 3PM | 4PM | 5PM | 6PM | 7PM | 8PM | 9PM | 10PM | 11PM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| casual | 61936 | 42461 | 31424 | 17758 | 11233 | 10005 | 18892 | 39396 | 64379 | 73464 | 91575 | 121560 | 153593 | 165915 | 174302 | 184622 | 200081 | 228076 | 231799 | 194049 | 142092 | 110909 | 101729 | 86912 |
| member | 45898 | 27641 | 18455 | 9705 | 7533 | 15502 | 58520 | 129186 | 199778 | 175106 | 135886 | 149397 | 186537 | 190571 | 186074 | 197941 | 247056 | 331833 | 321801 | 247004 | 170438 | 123980 | 96233 | 70002 |
By answering Question 4, I learned that, for the casual members,
there was a steady rise in the use of the bikes from 5AM to 6PM and a
steady fall after that. On the other hand, for the annual members, there
were two noticeable peaks at 8AM and 5PM.
The four seasons, spring, summer, fall and winter, were the metrics that I used to answer this question. The four seasons were defined as such: Spring, March to May, Summer, June to August, Fall, September to November, and Winter, December to February.
--Finding each member type's seasonal preference for riding the bikes.
WITH temp_table AS(
SELECT member_casual, CAST(started_at AS date)
FROM cyclistic.trip_data_v02
)
SELECT member_casual AS member_type,
COUNT(started_at) FILTER (WHERE started_at >= ('2022-03-01'::date) AND started_at <= ('2022-05-31'::date)) AS "spring",
COUNT(started_at) FILTER (WHERE started_at >= ('2022-06-01'::date) OR started_at <= ('2021-08-31'::date)) AS "summer",
COUNT(started_at) FILTER (WHERE started_at >= ('2021-09-01'::date) AND started_at <= ('2021-11-30'::date)) AS "fall",
COUNT(started_at) FILTER (WHERE started_at >= ('2021-12-01'::date) AND started_at <= ('2022-02-28'::date)) AS "winter"
FROM temp_table
GROUP BY member_casual;
| member_type | spring | summer | fall | winter |
|---|---|---|---|---|
| casual | 496711 | 1223754 | 728023 | 109674 |
| member | 793435 | 1172158 | 1019239 | 357245 |
By answering Question 5, I learned that the order of the seasonal popularity for riding the bikes were the same for the two member types. It was summer, fall, spring and winter in the order of most popular to least popular. However, I noticed that the annual members had more evenly distributed seasonal preferences for riding the bikes than the casual members
There were 3 bike types for the members to choose from: classic bike, electric bike and docked bike.
-- Finding bike type preferences for each member type.
SELECT member_casual AS member_type, rideable_type, COUNT(rideable_type)
FROM cyclistic.trip_data_v02
GROUP BY rideable_type, member_casual;
| member_type | rideable_type | count |
|---|---|---|
| casual | classic_bike | 1218261 |
| member | classic_bike | 1971040 |
| casual | docked_bike | 253366 |
| casual | electric_bike | 1086535 |
| member | electric_bike | 1371037 |
By answering Question 6, I learned that the annual members do not ride docked bikes. I also learned that both member types preferred classic bikes to electric bikes. However, electric bikes were still a popular option.
For this question, I wanted to see the geographical density of the popular areas for riding the bikes. To solve this problem, I used the counts of the start and the end points of the ride as the metrics for popularity. I extracted two csv files from the database, one with the start point information and the other with the end point information. I limited the data to those with the counts that were bigger than 10. This was done to limit the data size by getting rid of the relatively insignificant data points, so that Tableau could process the data without crashing.
WITH temp_table AS(
SELECT member_casual, start_station_name, start_lat, start_lng, count(*) AS start_point_count
FROM cyclistic.trip_data_v02
GROUP BY member_casual, start_station_name, start_lat, start_lng
ORDER BY member_casual, start_point_count DESC
)
SELECT *
FROM temp_table
WHERE start_point_count > 10; /* This condition is for limiting the size of the data,
eliminating the relatively insignificant points under count 10,
so that Tableau can process the data without crashing. */
| member_casual | start_station_name | start_lat | start_lng | start_point_count |
|---|---|---|---|---|
| casual | Streeter Dr & Grand Ave | 41.89228 | -87.61204 | 53115 |
| casual | DuSable Lake Shore Dr & Monroe St | 41.88096 | -87.61674 | 23038 |
| casual | DuSable Lake Shore Dr & North Blvd | 41.91172 | -87.62680 | 19390 |
| casual | Theater on the Lake | 41.92628 | -87.63083 | 16892 |
| casual | Michigan Ave & Oak St | 41.90096 | -87.62378 | 16663 |
| casual | Millennium Park | 41.88103 | -87.62408 | 15692 |
| casual | Wells St & Concord Ln | 41.91213 | -87.63466 | 13191 |
| casual | Clark St & Lincoln Ave | 41.91569 | -87.63460 | 12251 |
| casual | Clark St & Armitage Ave | 41.91831 | -87.63628 | 11767 |
| casual | Dusable Harbor | 41.88698 | -87.61281 | 11291 |
WITH temp_table AS(
SELECT member_casual, end_station_name, end_lat, end_lng, count(*) AS end_point_count
FROM cyclistic.trip_data_v02
GROUP BY member_casual, end_station_name, end_lat, end_lng
ORDER BY member_casual, end_point_count DESC
)
SELECT *
FROM temp_table
WHERE end_point_count > 10; /* This condition is for limiting the size of the data,
eliminating the relatively insignificant points under count 10,
so that Tableau can process the data without crashing. */
| member_casual | end_station_name | end_lat | end_lng | end_point_count |
|---|---|---|---|---|
| casual | Streeter Dr & Grand Ave | 41.89228 | -87.61204 | 61009 |
| casual | DuSable Lake Shore Dr & North Blvd | 41.91172 | -87.62680 | 25218 |
| casual | DuSable Lake Shore Dr & Monroe St | 41.88096 | -87.61674 | 24680 |
| casual | Michigan Ave & Oak St | 41.90096 | -87.62378 | 20465 |
| casual | Theater on the Lake | 41.92628 | -87.63083 | 19892 |
| casual | Millennium Park | 41.88103 | -87.62408 | 19458 |
| casual | Wells St & Concord Ln | 41.91213 | -87.63466 | 15182 |
| casual | Clark St & Lincoln Ave | 41.91569 | -87.63460 | 13794 |
| casual | Clark St & Armitage Ave | 41.91831 | -87.63628 | 13738 |
| casual | Shedd Aquarium | 41.86723 | -87.61536 | 13154 |
By answering Question 7, I learned that the casual members tend to ride the bikes along the coastlines more than the annual members, and I also learned that the annual members’ rides were relatively more diverse and spread out into the inland in comparison to the casual members’ rides.
Even though the casual members only traveled 8 percent further than the annual members on average, it took the casual members more than twice as longer as the annual members to finish the trips. This meant that the casual members roamed around a lot more than the annual members. I speculated that this is because the casual members are more likely to use the bikes for leisurely purposes and the annual members are more likely to use the bikes as a means of efficient transportation. This claim was additionally supported by the facts that the bike usage was popular among the annual members during the commute times and during the weekdays and that the casual members used the bikes more on the weekends along the coastlines where many people visit for leisurely purposes.
Another speculation for why the casual members use the bikes longer than the annual members on average is that the casual members feel that riding the bikes for a longer period of time is a better deal for them because they have to pay the initial fee that takes up a big portion of their total fares and they want to make sure that they get their money’s worth. Unlike the casual members, the annual members do not have to pay fares when they use the bikes because they had already paid for their annual memberships up front.
Based on the data analysis, I came up with strategies for converting the casual members into annual members: